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Abstract 

The interest in variable selection for clustering has increased recently due 
to the growing need in clustering high-dimensional data. Variable selection 
allows in particular to ease both the clustering and the interpretation of the 
results. Existing approaches have demonstrated the efficiency of variable se- 
lection for clustering but turn out to be either very time consuming or not 
sparse enough in high-dimensional spaces. This work proposes to perform a 
selection of the discriminative variables by introducing sparsity in the loading 
matrix of the Fisher-EM algorithm. This clustering method has been recently 
proposed for the simultaneous visualization and clustering of high-dimensional 
data. It is based on a latent mixture model which fits the data into a low- 
dimensional discriminative subspace. Three different approaches are proposed 
in this work to introduce sparsity in the orientation matrix of the discrimina- 
tive subspace through ^i-type penalizations. Experimental comparisons with 
existing approaches on simulated and real-world data sets demonstrate the 
interest of the proposed methodology. An application to the segmentation of 
hyperspectral images of the planet Mars is also presented. 



1 Introduction 

With the exponential growth of measurement capacities, the observed data are nowa- 
days frequently high- dimensional and clustering such data remains a challenging 
problem. In particular, when considering the mixture model context, the corre- 
sponding clustering methods show a disappointing behavior in high- dimensional 
spaces. They suffer from the well-known curse of dimensionality [5] which is 
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mainly due to the fact that model-based clustering methods are dramatically over- 
parametrized in high-dimensional spaces. Moreover, even though we dispose of many 
variables to describe the studied phenomenon, most of the time, only a small subset 
of these original variables are in fact relevant. 

Several recent works have been interested to simultaneously cluster data and 
reduce their dimensionality by selecting relevant variables for the clustering task. A 
common assumption to these works is that the true underlying clusters are assumed 
to differ only with respect to some of the original features. The clustering task 
aims therefore to group the data on a subset of relevant features. This presents 
two practical advantages: clustering results should be improved by the removing of 
non informative features and the interpretation of the obtained clusters should be 
eased by the meaning of retained variables. In the literature, variable selection for 
clustering is handled in two different ways. 

On the one hand, some authors such as [El [201 EH EH] tackle the problem 
of variable selection for model-based clustering within a Bayesian framework. In 
particular, the determination of the role of each variable is recast as a model selection 
problem. A first framework was proposed by Raftery and Dean [2S] in which two 
kinds of subsets of variables are defined: a subset of relevant variables and a subset 
of irrelevant variables which are independent from the clustering but which can be 
explained from the relevant variables through a linear regression. An extension of 
the previous work has then been proposed by Maugis et al. [21] who consider two 
kinds of irrelevant variables: the ones which can be explained by a linear regression 
from a subset of the clustering variables and finally a set of irrelevant variables which 
are totally independent of all the relevant variables. The models in competition are 
afterward compared with the integrated log-likelihood via a BIC approximation. 
Even though these approaches present good results in most practical situations, 
their computational times are nevertheless very high and can lead to an intractable 
procedure in the case of high- dimensional data. 

On the other hand, penalized clustering criteria have also been proposed to deal 
with the problem of variable selection in clustering. In the Gaussian mixture model 
context, several works, such as [2Z1 E21 ESI ES] in particular, introduced a penalty 
term in the log-likelihood function in order to yield sparsity in the features. The 
penalty function can take different forms according to the constraints imposed on 
the structure of the covariance matrices. The introduction of a penalty term in the 
log-likelihood function was also used in the mixture of factor analyzers approaches, 
such as in [TBI EE]- More recently, Witten and Tibshirani [33] proposed a general 
non-probabilistic framework for variable selection in clustering, based on a general 
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penalized criterion, which governs both variable selection and clustering. It ap- 
pears nevertheless that the results of such procedures are usually not sparse enough 
and select a large number of the original variables, especially in the case of high- 
dimensional data. 

Other approaches focus on simultaneously clustering the data and reducing their 
dimensionality by feature extraction rather than feature selection. We can cite in 
particular, the subspace clustering methods [HJ [T71 [2U E31 [2HJ EI] which are based 
on probabilistic frameworks and model each group in a specific and low-dimensional 
subspace. Even though these methods are very efficient in practice, they present 
nevertheless several limitations regarding the understanding and the interpretation 
of the clusters. Indeed, in most of subspace clustering approaches, each group is 
modeled in its specific subspace which makes difficult a global visualization of the 
clustered data. Even though some approaches 0, model the data in a com- 
mon and low- dimensional subspace, they choose the projection matrix such as the 
variance of the projected data is maximum and this can not be sufficient to catch 
discriminative information about the group structure. 

To overcome these limitations, Bouveyron and Brunet [5J recently proposed a 
new statistical framework which aims to simultaneously cluster the data and pro- 
duce a low-dimensional and discriminative representation of the clustered data. The 
resulting clustering method, named the Fisher-EM algorithm, clusters the data into 
a common latent subspace of low dimensionality which best discriminates the groups 
according to the current fuzzy partition of the data. It is based on an EM proce- 
dure from which an additional step, named F-step, is introduced to estimate the 
projection matrix whose columns span the discriminative latent space. This pro- 
jection matrix is estimated at each iteration by maximizing a constrained Fisher's 
criterion conditionally to the current soft partition of the data. As reported in [5J, 
the Fisher-EM algorithm turned out to outperform most of the existing clustering 
methods while providing a useful visualization of the clustered data. However, the 
discriminative latent space is defined by "latent variables" which are linear combina- 
tions of the original variables. As a consequence, the interpretation of the resulting 
clusters according to the original variables is usually difficult. An intuitive way to 
avoid such a limitation would be to keep only large loadings variables, by thresh- 
olding for instance. Even though this approach is commonly used in practice, it has 
been particularly criticized by Cadima [TU] since it induces some misleading infor- 
mation. Furthermore, it often happens when dealing with high- dimensional data 
that a large number of noisy or non- informative variables are present in the set of 
the original variables. Since the latent variables are defined by a linear combination 
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of the original ones, the noisy variables may remain in the loadings of the projection 
matrix and this may produce a deterioration of the clustering results. 

To overcome these shortcomings, three different approaches are proposed in this 
work for introducing sparsity in the Fisher-EM algorithm and thus select the dis- 
criminative variables among the set of original variables. The remainder of this 
document is organized as follows. Section 2 reviews the discriminative latent mix- 
ture model of [B] and the Fisher-EM algorithm which was proposed for its inference. 
Section 3 develops three different procedures based on l\ penalties for introducing 
sparsity into the Fisher-EM algorithm. The first approach looks for the best sparse 
approximate of the solution of the F-step of the Fisher-EM algorithm. The second 
one recasts the optimization problem involved of the F-step as a lasso regression-type 
problem. The last approach is based on a penalized singular value decomposition 
(SVD) of the matrix involved in the constrained Fisher criterion of the F-step. 
Numerical experiments are then presented in Section 4 to highlight the practical be- 
havior of the three sparse versions of the Fisher-EM algorithm and to compare them 
to existing approaches. In section 5, a sparse version of the Fisher-EM algorithm 
is applied to the segmentation of hyperspectral images. Section 6 finally provides 
some concluding remarks and ideas for further works. 

2 The DLM model and the Fisher-EM algorithm 

In this section, we briefly review the discriminative latent mixture (DLM) model [B] 
and its inference algorithm, named the Fisher-EM algorithm, which models and 
clusters the data into a common latent subspace. Conversely to similar approaches, 
such as [SI EH E51 ESI EZ], this latent subspace is assumed to be discriminative and 
its intrinsic dimension is strictly bounded by the number of groups. 

2.1 The DLM model 

Let {yi, . . . ,y n } G MP denote a dataset of n observations that one wants to clus- 
ter into K homogeneous groups, i.e. adjoin to each observation yi a value Z{ G 
{1, . . . , K} where Zi = k indicates that the observation yi belongs to the kth group. 
On the one hand, let us assume that {yi, ■ ■ ■ ,y n } are independent observed real- 
izations of a random vector Y G M p and that {zi,...,z n } are also independent 
realizations of a random variable Z G {1, . . . , K}. On the other hand, let EcM f 
denote a latent space assumed to be the most discriminative subspace of dimension 
d < K — 1 such that G E and K < p. Moreover, let {x\, . . . , x n } G E denote the 
actual data, described in the latent space E of dimension d, which are in addition 
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Figure 1: Graphical summary of the DLMp fci gi model. 

presumed to be independent realizations of an unobserved random vector IgE. 
Finally, the observed variable Y G MP and the latent variable X G E are assumed 
to be linked through a linear transformation: 

Y = UX + e, (1) 

where U is a p x d orthonormal matrix common to the K groups and satisfying 
U l U = Id- The p-dimensional random vector e stands for the noise term which 
models the non discriminative information and which is assumed to be distributed 
according to a centered Gaussian density function with a covariance matrix (e ~ 
Af(0, Besides, within the latent space, X is assumed, conditionally to Z = k, 
to be Gaussian : 

X lz=k ~ JV(// fe , E fc ) (2) 

where //& G M d and G R dxd are respectively the mean vector and the covariance 
matrix of the kth group. Given these distribution assumptions and according to 
equation ([T]), 

Y\ Xt z=k~M(UX,ilf), (3) 
and its marginal distribution is therefore a mixture of Gaussians: 

K 

f(y) = Yl ^(y; m k , s k ), (4) 

k=l 

where 7fk is the mixing proportion of the kth group and </)(.; m^, Sk) denotes the 
multivariate Gaussian density function parametrized by the mean vector = U ^ 
and the covariance matrix Sk = UY>kU l + ^ of the fcth group. Furthermore, we define 
the p x p matrix W = [U, V] such that W l W = WW t = I p , where the (p — d) x p 
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matrix V is an orthogonal complement of U . Finally, the noise covariance matrix 
\I/ is assumed to satisfy the conditions V^/V 1 = ftl p -a and U^U 1 = Od, such that 
Afc = W t SkW has the following form: 



V 











ft 
o p 



\ 



d < K- 1 



/ 



(p-d) 



These last conditions imply that the discriminative and the non-discriminative sub- 
spaces are orthogonal, which suggests in practice that all the relevant clustering 
information remains in the latent subspace. This model is referred to by DLMp^] 
in [BJ and a graphical summary is given in Figure [TJ 



2.2 A family of parsimonious models 

Several other models can be obtained from the DLMp fci g] model by relaxing or adding 
constraints on model parameters. Firstly, it is possible to consider a more general 
case than the DLMp fc /3] by relaxing the constraint on the variance term of the non 
discriminative information. Assuming that e\z=k ~ A/"(0, ^k) yields the DLMr^^i 
model which can be useful in some practical cases. From this extended model, 10 
parsimonious models can be obtained by constraining the parameters and ft^ to 
be common between and within the groups. For instance, the covariance matrices 
Ei, ... , E# in the latent space can be assumed to be common across the groups 
and this sub-model is referred to by DLMrj^j. Similarly, in each group, E& can be 
assumed to be diagonal, i.e. E fc = diag(c%L, . . . , a^)- This sub-model is referred 
to by DLM[ Qfej/ 3 fc ]- These sub-models can also be declined by considering that the 
parameter ft is common to all classes (Vfc, ft^ — ft). A list of the 12 different DLM 
models is given by Table [TJ and detailed descriptions can be found in [B]. Such a 
family yields very parsimonious models and allows, in the same time, to fit into 
various situations. In particular, the complexity of the DLMp fe/ g fe ] model mainly 
depends on the number of clusters K since the dimensionality of the discriminative 
subspace is such that d < K — 1. Notice that the complexity of the DLMp fci g fc ] grows 
linearly with p contrary to the traditional Gaussian models in which the complexity 
increases with p 2 . As an illustration, if we consider the case where p = 100, K = 4 
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Model 


Nb. 


of parameters 










if = 4 and 
p= 100 


DLM [S)A] 


(K 


-1)- 


VK(K~- 1) 4 


-{K- l)(p-K/2)- 


h K 2 (K - 


-l)/2- 


fif 


337 






(K 


-1)- 


I- K(K - 1) -f 


-{K-l)[p-K/2)- 


\- K 2 (K - 


-l)/2- 


f 1 


334 




DLM[ S(3fc ] 


(K 


-1)- 


I- K(K - 1) -f 


-(K-l){p-K/2)- 


f K(K - 


l)/2 + 


K 


319 




DLM [S(3] 


(K 


-1)- 


I- K(K - 1) 4 


-{K-l)(j>-K/2)- 


VK(K - 


l)/2 + 


1 


316 




DLM [Qfcj&] 


(K 


-1)- 


VK{K- 1) 4 


-(K-l)(p-K/2)- 


IK 2 






325 




DLM [Qfcj/3] 


(K 


-1)- 


VK(K- 1) 4 


-{K-l)(p-K/2)- 


f K(K - 






322 




DLM [Qfcft] 


(K 


-1)- 


I- ^(if - 1) 4 


-{K-l)(p-K/2)- 


h2K 






317 




DLM [a fe /3] 


(K 


-1)- 


h K(K - 1) 4 


- (K -K/2) - 


hK + 1 






314 






(K 


-1)- 


h - 1) 4 


-(K-l)(p-K/2)- 


h(K-l) 


+ K 




316 




DLM [Qj/3] 


(K 


-1)- 


1- if (if - 1) 4 


-(K-l)(p-K/2)- 


f(A--l) 


+ 1 




313 




DLM [a&] 


(K 


-1)- 


VK(K- 1) 4 


- (K -l)(p -K/2) - 


f K + l 






314 




DLM [Q/3] 


(K 


-1)- 


YK{K- 1)4 


-{K- l){p-K/2)- 


f 2 






311 




Full-GMM 


(K 


-1)- 


- -fTp 4- Kp(p 


+ l)/2 








20603 




Com-GMM 


(K 


-1)- 


- Kp + p(p + 


l)/2 








5453 




Diag-GMM 


(K 


-1)- 


\- Kp + Kp 










803 




Sphe-GMM 


(K 


-1)- 


VKp + K 










407 




MFA 


(K 


-1)- 


V Kp + Kd[p 


-{d-l)/2)]+Kp 








1991 (d = 


3) 


Mixt-PPCA 


(K 


-1)- 


h Kp + K[d(p ~{d + l)/2) 4- d + 


1] + 1 






1198 (d = 


3) 


PGMM-CUU 


(K 


-1)- 


VKp + d\p - 


(d+l)/2]+Kp 








1100 (d = 


3) 


MCFA 


(K 


-1)- 


h Kd + p + d 


p- (d+ l)/2] 4- Kd(d+ l)/2 






433 (d = 


3) 


MCUFSA 


(K 


-1)- 


\- Kd + \ + d 


p- (d+l)/2]+Kd 






322 (d = 


3) 



Table 1: Number of free parameters to estimate when d = K — 1 for the DLM 
models and some classical models (see text for details). 



and d = 3, then the number of parameters to estimate for the DLMp^j is 337 
which is drastically less than in the case of the Full-GMM (20 603 parameters to 
estimate). For a comparison purpose, Table [T] presents also the complexity of other 
clustering methods, such as Mixt-PPCA [3T], MFA [23], PGMM |23|, MCFA [1] and 
MCUFSA |38j for which the complexity grows linearly with p as well. 

2.3 The Fisher-EM algorithm 

An estimation procedure, called the Fisher-EM algorithm, is also proposed in [B] in 
order to estimate both the discriminative space and the parameters of the mixture 
model. This algorithm is based on the EM algorithm from which an additional step is 
introduced, between the E and the M-step. This additional step, named F-step, aims 
to compute the projection matrix U whose columns span the discriminative latent 
space. The Fisher-EM algorithm has therefore the following form, at iteration q: 
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The E-step This step computes the posterior probabilities t\f that the observa- 
tions belong to the K groups using the following update formula: 

4? = *tMvJt 1} )/f:*tMvijt\ (5) 

with 9 k = {/xfe,Sfc, /§*.,£/}■ 

The F-step This step estimates, conditionally to the posterior probabilities, the 
orientation matrix of the discriminative latent space by maximizing the Fisher's 
criterion [T3| [T5] under orthonormality constraints: 



U {q) = max trace ((U t SU)- 1 U t S% ) U) , 

w.r.t. U t U = I dl (6) 



where S stands for the covariance matrix of the whole dataset and S§ , defined as 
follows: 

Sf = ~j:n^\mf-y){m^-y)\ (7) 
n fe=i 

denotes the soft between covariance matrix with n% = Z)£=i Hk \ m k^ = l/ n fc^ S"=i tit Hi 
and y = l/nX)^=i Vi- This optimization problem is solved in [6] using the concept of 
orthonormal discriminant vector developed by [H] through a Gram-Schmidt proce- 
dure. Such a process enables to fit a discriminative and low- dimensional subspace 
conditionally to the current soft partition of the data while providing orthonormal 
discriminative axes. In addition, according to the rank of the matrix Sg , the di- 
mensionality of the discriminative space d is strictly bounded by the number of 
clusters K. 

The M-step This third step estimates the parameters of the mixture model in 
the latent subspace by maximizing the conditional expectation of the complete log- 
likelihood: 

Q(°) = -^E4 9) [-21og(^) +trace(^ 1 ^c£ 9) ^) +log(|E fc |) 
1 fc=i 

+ (p-d) log(p k ) + k 3 k —±- + plog(2vr) . (8) 

where C% = -^y Yh=i 4a? (v* ~ m k^){yi ~ m i^)* is the empirical covariance matrix 
of the fcth group and Uj is the jth column vector of U^ q \ = Y^=\t% ■ Hence, 
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maximizing Q conditionally to leads to the following update formula for the 
mixture parameters of the model DLMp fc/ g fc i: 




n 



k 



(9) 



n 




1 71 ) 
< i=i 



(10) 



u {q)t c k u {q \ 

trace(C fc )-E; = i ftgc^ 



(11) 



A' 



(?) 



p — d 



(12) 



The Fisher-EM procedure iteratively updates the parameters until the Aitken cri- 
terion is satisfied (see paragraph 4.5 of [6J). The convergence properties of the 
Fisher-EM algorithm have been studied in J7j. It is also proposed in this work 
to use a stopping criterion based on the Fisher criterion involved in the F-step to 
improve the clustering performance. Finally, since the latent subspace has a low 
dimension and common to all groups, the clustered data can be easily visualized by 
projecting them into the estimated latent subspace. 



3 Sparse versions of the Fisher-EM algorithm 



Even though the Fisher-EM algorithm turns out to be efficient both for modeling 
and clustering data, the interpretation of clustering results regarding the original 
variables remains difficult. In this section, we propose therefore three different ways 
to introduce sparsity into the loadings of the projection matrix estimated in the 
F-step of the Fisher-EM algorithm. 

3.1 A two-step approach 

In this first approach, we propose to proceed in two steps. First, at iteration q, 
the traditional F-step of the Fisher-EM algorithm computes an estimate of the 
orientation matrix of the discriminative latent space conditionally to the posterior 
probabilities t^}. Then, the matrix is approximated by a sparse one using 
the following result. 

Proposition 3.1. The best sparse approximation of at the level A is the 
solution of the following penalized regression problem: 



min X {q)t - 



u 



3^1* + A E 1*4 
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where U = [Ui, ...,Ud], Uj G MP is the jth column vector ofU, \\.\\ F is the Frobenius 
norm and X® = U {q)t Y . 

Proof. Let be the orientation matrix of the discriminative latent space estimated 
by the F-step at iteration (g) and let us define X^ = JJ^Y G M. dxn the matrix of 
the projected data into the subspace spanned by U^ q \ where Y G W xn denotes the 
original data matrix. Since X^ is generated by U^ q \ then is solution of the 
least square regression of X^ q ' on Y: 



mm 

u 



- Y l U 2 



where U = \Ui, ■■■,Ud], Uj G W is the jth column vector of U, \\.\\ F is the Frobenius 
norm. A penalized version of this regression problem can be obtained by adding a 
£i-penalty term as follows: 



mm 

u 



x^-Ym+xitm,, 

3=1 



and the solution of this penalized regression problem is therefore the best sparse 
approximation of U^ qS> at the level A. □ 

The previous result allows to provide a sparse approximation of U™' but we 
have no guarantee that the is orthogonal as required by the DLM model. The 
following proposition solves this issue. 

Proposition 3.2. The best orthogonal approximation of is = 

where and v^ q ' are respectively the left and right singular vectors of the SVD of 

Proof. Let us consider the matrix U^ g >. Searching the best orthogonal approximation 
of the matrix is equivalent to solving the following optimization problem: 



mm 

u 



U {q) - U w.r.t. U l U = I c 

F 



This problem is a nearest orthogonal Procrustes problem which can be solved by a 
singular value decomposition [T5]. Let u^A^v^ be the singular value decomposi- 
tion of U( q \ then 

u (q) v (<i)t i s 

the best orthogonal approximation of . □ 

From an practical point of view, the penalized regression problem of Proposi- 
tion l3.1l can be solved by alternatively regressing each column vector of the projected 
matrix . The sparse and orthogonal approximation U^ q > of is obtained af- 
terward through a SVD of . The following algorithm summarizes these steps. 
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Algorithm 1 - F-step of the sparseFEM-1 algorithm 



1. At iteration q, compute the matrix by solving 

2. Compute X® = U (q)t Y. 

3. For j G {1, . . . , d}, solve d independent penalized regression problems with the 
LARS algorithm [T2]: 



Uf ] = argmin xf z - Y% ' + X \U j \ l , 

4. Repeat step 3 several times until convergence. 

5. Let Cfa) = [U[ q \...,U^\ compute the SVD of = m^A^V^ and let 
jj{q) = u (<i) v {<i)t_ 



Si) 1 vh 



2 



Let us remark that this problem can be extended to a more general penalized 
regression by adding a ridge penalty term. This allows in particular to handle 
the n < p case which occurs frequently nowadays. In such a case, the elastic-net 
algorithm [UJ has to be used instead of the LARS algorithm in Algorithm 1. 

Nevertheless, a limitation of such a procedure may be the disconnection between 
the estimation of the discriminative subspace and the introduction of the sparsity in 
the loadings of the projection matrix. To avoid that, the two following approaches 
aim to propose penalized Fisher criteria for which the solutions fit directly a sparse 
and discriminative latent subspace. 



3.2 A penalized regression criterion 

We therefore propose here to reformulate the constrained Fisher criterion in- 
volved in the F-step of the Fisher-EM algorithm as a penalized regression problem. 
Consequently, the solution of this penalized regression problem will fit directly a 
sparse and discriminative latent subspace. To this end, let us introduce the soft 
matrices H^) and which will be computed, conditionally to the E-step, at each 
iteration q of the sparse F-step as follows: 

Definition 3.1. The soft matrices H$ G W xn and H$ G R pxK are defined, 
conditionally to the posterior probabilities t\f computed in the E-step at iteration q, 
as follows: 
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H 



(?) 

w 



K 



Y hk '"i, ■ 

k=l 



K 



k=l 



{q) m\ q) 



H 



(?) 



nS 9) (m! 9) 



y),. 



(?)/_(?) 



y) 



e M pxn 
g 



ppxii" 



(13) 
(14) 



where — J27=i t-ik an d m k^ = n ^?=i ^ikVi zs so ft mean vector of the cluster k. 



,(?) 



1 v^ra 



According to these definitions, the matrices Hy) and H^ satisfy: 



r(?) 



H$H$ t = S® and 



q(?) 



(15) 



where = l/nj^^i ^jt Cfc stands for the soft within covariance matrix com- 
puted at iteration g and denotes the soft between covariance matrix defined in 
equation ([7j). A penalized version of the optimization problem (jH]) can be therefore 
formulated penalized regression-type problem: 

Proposition 3.3. The best sparse approximation £/w of the solution of (|7jJ) at the 

level A is the solution B^ of the following penalized regression problem: 



K 



mm 

A,B 



in £ - A&H® + p £ + A £ 



^?(?)-* u(?) 

fc=i 

w.r.t A* A = I 



i=i 



i=i 



where A = [a u . . . , a d ) G ^ xd ; £? = [ft, . . . , ft] G R pxd , R$ 
triangular matrix resulting from the Cholesky decomposition of Sy) , i.e. Sy/ = 
R^R^, H^ k is the kth column of H$ and p > 0. 

Proof. First, let us consider that the matrix A is fixed at iteration q . Then, opti- 
mizing : 



ppxp 



is a upper 



r^\\ R w- tH Bi- ABt H$k 
a ' b k=i 



(16) 



conditionally to A leads to consider the following regularized regression problem: 



mm£ {Hf'R^a, - flj?*/^ + P^fr 



with 5 = [ft, ... , (3d]. Solving this problem is equivalent to solving d independent 
ridge regression problem and the solution B^ is : 



6® = (S%> + pS^y 1 'R^ 1 A. 



(17) 
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By substituting B^ in Equation ffTEj) . optimizing the objective function ffTB]) over 
A, given A* A = 1^ and B^ fixed, is equivalent to maximize the quantity: 



max trace (B^H^H^R^A 



w 



r.t. A 1 A = L 



According to Lemma 1 of [28J, this is a Procrustes problem [18J which has an ana- 
lytical solution by computing the singular value decomposition of the quantity: 

where the column vectors of the p x d matrix are orthogonal and is a d x d 
orthogonal matrix. The solution is A^ = Substituting A® into flT7]) gives: 

By remarking that the <i eigenvectors associated to the non-zero eigenvalues of the 
generalized eigenvalue problem §6§ are the columns of i?^ u^, it follows that JB^ 
spans the same linear subspace than the solution of Therefore, the solution 
of the penalized optimization problem: 



min f; H^-^i - AB'H®, \ + p £ + A £ 



rr{q) _ A r>t Zjil) 
l W U B,k AD n B,k 
k=l j=l 3=1 

w.r.t. A 1 A = l d , 



is the best sparse approximation of the solution of at the level A. □ 

However and as in the previous case, the orthogonality of the column vectors 
of B^ is not guaranteed but this issue can be tackled by Proposition 13. 21 From a 
practical point of view, the optimization problem of Proposition 13.31 can be solved 
using the algorithm proposed by in the supervised case by optimizing alterna- 
tively over B with A fixed and over A with B fixed. This leads to the following 
algorithm in our case: 
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Algorithm 2 - F-step of the sparseFEM-2 algorithm 



1. At iteration q, compute the matrices H$ and Hy? from Equations f[T3"l) 
and dHJ. Let S$ = H$H$* and S§ } = H {q) H {q)t . 



r>(q)t Tf(q) 
J^w - n, w • 



2. Compute by using a Cholesky decomposition of Sy) + 7/p trace (S^) 



3. Initialization: 



Let B^ be the eigenvectors of S X S 



1 0(9) 



B 



Compute the SVD R^S^B® = u^A^v^* and let A® = n^v®*. 
4. Solve d independent penalized regression problems. For j = 1, . . . ,d: 

= argmin (^W^W^Pj - 2Y {q)t W {q) (3 3 + Ai \\^ 



I H {q)t \ ~ ( H"(9)t p(9)-l (9) 

where W® = (a) and Y® =[ H b H w 



5. Let = [ft, . . . , Compute the SVD of 0$'* S$ = u^A^v^ 1 and 
let = u^v^. 

6. Compute the SVD of B® = u'^A'^v'^ and let = u'^v'^K 

7. Repeat steps several times until convergence. 



3.3 A penalized singular value decomposition 

In this last approach, we reformulate the constrained Fisher criterion (jH]) involved in 
the F-step of the Fisher-EM algorithm as a regression problem which can be solved 
by doing the SVD of the matrix of interest in this regression problem. A sparse 
approximation of the solution of this regression problem will be obtained by doing 
a penalized SVD [54] instead of the SVD. To that end, let us consider the following 
result. 

Proposition 3.4. The solution of (JHJ) is also solution of the following constrained 
optimization problem: 

min£l^-^S|r 
u 1=1 

w.r.t. U l U = Li, 

where is the £th column of the soft between covariance matrix S$ computed at 
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iteration q. 

Proof. Let us first prove that minimizing the quantity Z)?=i W^bi ~ U^S^H 2 is 
equivalent to maximize tmce^S^S^U). To that end, we can write down the 
following equalities: 



£ ||4 9 i - = £ trace - - UU*)S® 

1=1 t=i 

( p 
= trace (I p - C/C/*)*(I P - 1717*) £ S^Jsg 



= trace (S^\l p -UU*) S$) 

= trace^sf) -trace^S^S^U). 

Consequently, minimizing over Z7 the quantity Y%=i 1 — Z7Z7*Sg £ | | 2 is equivalent 
to maximize trace(Z7 i ^ ) ^ )t Z7) according to U. Let us now consider the SVD of 
the n x p matrix S$ = ttAu* where u and t> stands for respectively the left and 
right singular vectors of and A is a diagonal matrix containing its associated 
singular values. Since the matrix S^' has a rank d at most equal to K — 1 < p, with 
if the number of clusters, then only d singular values of the matrix S$ are non 
zeros, which enables us to write S# = uK d v l , where A d = diag(Ai, . . . , A^, 0, . . . , 0). 
Moreover, by letting U = Ud the d first left eigenvectors of S B , then: 

tTUce(U t S B S t B U) = trace {u 1 {uKdV t ){uKdV t ) t U 
= trace (Z7*uA d A^/Z7) , 

d 

= E4 

3=1 

Consequently, the p x d orthogonal matrix U such that Z)?=i H^b^ — ^^^b^II 2 
is minimized, is the matrix made of the d first left eigenvectors of Besides, 
since S$ is symmetric and semi-definite positive, the matrix U contains also the 
eigenvectors associated with the d largest eigenvalues of S^ 2 and therefore the ones 
of . Therefore, assuming without loss of generality that S = I P ,U is also solution 
of the constrained optimization problem involved in the original F-step. □ 

The optimization problem of Proposition (13 .4p can be seen as looking for the 
projection matrix U such that the back-projection l&PSg g is as close as possible 
to Sg\. In |34j . Witten et al. have considered such a problem with a constraint of 
sparsity on U. To solve this problem, they proposed an algorithm which performs 
a penalized SVD of the matrix of interest in the constrained optimization problem. 
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Therefore, it is possible to obtain a sparse approximation of the solution of §6§ 
by doing a penalized SVD of with the algorithm of [2]. As previously, the 
orthogonality of the column vectors of is not guaranteed but this issue can be 
again tackled by Proposition E2J From a practical point of view, this third approach 
can be implemented as follows: 

Algorithm 3 - F-step of the sparseFEM-3 algorithm 

1. Let M x = Sf and d = rank (S B ) ■ 

2. For j G {1,...,4: 

(a) Solve Uj = argmax Uj . UjMjVj w.r.t. \\uj\\^ < 1, \\vj\\^ < 1 and 



Yfi=i u )e < Ai using the penalized SVD algorithm of 



(b) Update M j+1 = Mj - \jufv) 



3. m = [u?,...,u^]. 

4. Compute the SVD of = u^A^v^ and let = u^v^K 



3.4 Practical aspects 

The introduction of sparsity in the Fisher-EM algorithm presents several practical 
aspects among which the ability to interpret the discriminative axes. However, two 
questions remain: the choice of the hyper-parameter which determines the level of 
sparsity and the implementation strategy in the Fisher-EM algorithm. Both aspects 
are discussed below. 

Choice of the tuning parameter The choice of the threshold A is an important 
problem since the number of zeros in the d discriminative axes depends directly 
on the degree of sparsity. In [40], Zou et al. chose the hyper-parameter of their 
sparse PCA with a criterion based on the explanation of the variance approximated 
by the sparse principal components. In [33J, Witten and Tibshirani proposed for 
their sparse-kmeans to base the choice of the tuning parameter on a permutation 
method closely related to the gap statistic previously proposed by Tibshirani et 
al. [2D] for estimating the number of components in standard kmeans. Since our 
model is defined in a Gaussian mixture context, we propose to use the BIC criterion 
to select the threshold A. According to the consistency results obtained by Zou et 
al. |32] and the fact that the sparsity constraint is applied on the projection matrix 
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U, the effective number of parameters to estimate in the DLMp^j model is: 

7 e = (K - 1) + Kd + (d[paL§(d + 1) /2] - d e ) + Kd(d + l)/2 + K 

where d e is the number of zeros in the loading matrix. In the same manner, this 
effective number of parameters to estimate can be declined for the 11 other sub- 
models of the DLM family. 

Implementation of the sparse Fisher-EM algorithm We identified two dif- 
ferent ways to implement the sparse versions of the Fisher-EM algorithm. First, 
it could be possible to replace the usual F-step of the Fisher-EM algorithm by a 
sparse F-step developed previously. The resulting algorithm would sparsify at each 
iteration the projection matrix U before estimating the model parameters. This 
can however leads to some drawbacks since an early introduction of the t\ penalty 
could penalize too much the loadings of the projection matrix, in particular if the 
initialization is far away from the optimal situation. An alternative implementation 
would be to, first, execute the traditional Fisher-EM algorithm until convergence 
and, then, initialize the sparse Fisher-EM algorithm with the result of the Fisher-EM 
algorithm. This strategy should combine the efficiency of the standard Fisher-EM 
algorithm with the advantage of having a sparse selection of discriminative vari- 
ables. We therefore recommend this second implementation and it will be used in 
the experiments presented in the following sections. 

4 Experimental comparison 

This section presents comparisons with existing variable selection techniques on 
simulated and real-world data sets. 

4.1 Comparison on simulated data 

This first experiment aims to compare on simulated data the performances of the 
proposed sparseFEM algorithms (sparseFEM-1, sparseFEM-2, sparseFEM-3) to sev- 
eral competitors: Clustvarsel of Raftery and Dean [29], Selvarclust of Maugis et 
al. [22] and sparse-kmeans of Witten and Tibshirani [33]. For this experiment, 
we replicated the simulation proposed in Section 3.3 of [33j. We simulated K = 3 
Gaussian components of n observations in a 25-dimensional observation space whose 
components differ only on q = 5 features. The used parameters were /ikj = 
H x (lfc=ij< g , — lfc=2,j< g ), VA; G {1,2,3} and Vj G {!,..., p} for the mean com- 
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Simulation 


Method 


Clustering error 


non-zero variables 


n = 30 fj, = 0.6 


kmeans 
sparse-kmeans 
Clustvarsel 
Selvarclust 


0.48 ±0.05 
0.47 ±0.07 
0.62 ±0.06 
0.40 ±0.03* 


25.0 ±0.0 
19.0 ±6.6 
22.2 ±1.2 
8.1 ± 1.9* 


sparseFEM-1 
sparseFEM-2 
sparseFEM-3 


0.47 ±0.06 
0.48 ±0.07 
0.48 ±0.03 


2.6 ±0.9 

4.7 ± 1.8 
2.0 ±0.0 


n = 30 n = 1.7 


kmeans 
sparse-kmeans 
Clustvarsel 
Selvarclust 


0.14 ± 10.2 
0.08 ±0.06 
0.41 ±0.10 
0.08 ±0.08* 


25.0 ±0.0 
23.6 ±0.8 
16.6 ±10.4 
6.8 ± 1.4* 


sparseFEM-1 
sparseFEM-2 
sparseFEM-3 


0.14 ±0.13 
0.20 ±0.12 
0.17 ±0.11 


3.5 ±0.8 
5.4 ±2.2 
2.0 ±0.0 


n = 300 fj, = 0.6 


kmeans 
sparse-kmeans 
Clustvarsel 
Selvarclust 


0.43 ±0.03 
0.46 ±0.03 
0.42 ±0.03 
0.34 ±0.02* 


25.0 ±0.0 
24.0 ±0.5 
25.0 ±0.0 
7.0 ± 1.7* 


sparseFEM-1 
sparseFEM-2 
sparseFEM-3 


0.42 ±0.03 
0.43 ±0.03 
0.43 ± 0.04 


2.4 ± 1.0 

5.2 ±2.7 

2.3 ±1.1 


n = 300 n = 1.7 


kmeans 
sparse-kmeans 
Clustvarsel 
Selvarclust 


0.05 ±0.06 
0.05 ±0.01 
0.05 ±0.01 
0.05 ±0.01* 


25.0 ±0.0 
15.0 ±0.0 
25.0 ±2.0 
5.6 ±0.9* 


sparseFEM-1 
sparseFEM-2 
sparseFEM-3 


0.04 ±0.01 
0.05 ±0.01 
0.04 ±0.01 


10.2 ±2.4 
8.8 ±1.7 
5.6 ± 1.6 



Table 2: Clustering errors and numbers of non-zero variables averaged over 20 sim- 
ulations for several clustering methods with p = 25 and q = 5. The results of 
Selvarclust are reported from [IT) . 
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ponents and crL = 1 for the variance terms. Moreover, four different situations are 
considered: n = 30 or 300 and [i = 0.6 or 1.7. Each simulation was replicated 25 
times. 

Table E] presents the means and standard deviations for both the clustering er- 
ror and the number of non-zero variables for kmeans, sparse-kmeans, Clustvarsel, 
Selvarclust and the 3 procedures of sparseFEM. Note that the results of Selvarclust 
corresponds to clustering errors and non-zero variable rates found in [ITj. Moreover, 
the reported results concerning the 3 sparse Fisher-EM algorithms were obtained 
with the DLM[ Qfe( g] model for a sparsity level corresponding to the highest BIC value 
obtained at each trial. 

Two main remarks can be done on the results presented in Table EJ First, 
by considering either the most difficult clustering cases (n = 30 and \x = 0.6) or 
the easiest one (n = 300 and /i = 0.6 or 1.7), all approaches present approxima- 
tively the same results in terms of clustering error rate. The methods differ how- 
ever in the number of variables they retain to perform the clustering: Clustevarsel, 
sparseFEM- 1, sparseFEM-2 and sparseFEM-3 turn out to select significantly less 
variables than sparse-kmeans and Clustvarsel. In particular, Clustevarsel and the 
sparseFEM algorithms select a number of useful variables consistent with the actual 
number of meaningful variables (q = 5). Second, in the situation where n = 30 
and [1 = 1.7, Selvarclust and sparse-kmeans present the lowest misclassification rate 
(0.08), even though the clustering error of sparseFEM-1 and kmeans remains rel- 
atively low (0.14). However, as previously, only Clustevarsel and the sparseFEM 
algorithms select a number of variables close to the right number of discriminative 
features. 

4.2 Comparison on real data sets 

Real-world data sets are now used to compare the efficiency of the sparseFEM 
algorithms to its competitors for both the clustering and variable selection tasks. 
We considered 7 different benchmark data sets coming mostly from the UCI machine 
learning repository. We selected these data sets because they represent a wide range 
of situations in term of number of observations n, number of variables p and number 
of groups K. These characteristics are given in the top row of Table [3] and a detailed 
description of these data sets can be found in [6] . 

For this experiment, we used the 3 sparseFEM algorithms and the 3 sparse meth- 
ods introduced previously (sparse-kmeans, Clustvarsel and Selvarclust). Since the 
evaluation of the clustering performance is a complex and very discussed problem, 
we chose to evaluate the clustering performance as the adequacy between the re- 
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suiting partition of the data and the known labels for these data. For each data 
set, the sparseFEM algorithms were initialized with a common random partition 
drown from a multinomial distribution with equal prior probabilities. For Clust- 
varsel, Selvarclust and sparse-kmeans, the initialization was done with their own 
deterministic procedure. Moreover, for each method, the number K of groups has 
been fixed to the actual one. For Clustvarsel, Selvarclust and sparse-kmeans, the 
determination of the other free parameters was done according to the tools provided 
by each approach. For the sparseFEM algorithms, we used the penalized BIC cri- 
terion to select the model and the level of sparsity. More precisely, we first chose 
the model presenting the highest average BIC value on 20 replications. Then, given 
the selected model, we selected the level of sparsity associated with the highest BIC 
value. 

Table[3]presents the average clustering accuracies and the associated standard de- 
viations obtained for the 6 approaches. The average number of non-zero variables is 
also reported within brackets in the table. The results associated to the sparseFEM 
algorithms have been obtained by averaging over 20 trials with random initializa- 
tions. The lack of standard deviations for Clustvarsel, Selvarclust and sparse-kmeans 
is due to the deterministic initializations they use. It first appears that the three 
sparse versions of the Fisher-EM algorithm perform rather similarly both in term 
of clustering and variable selection. It also appears clearly that the sparseFEM 
algorithms are competitive to existing methods regarding both the clustering per- 
formances and the selection of variables. Indeed, the sparseFEM algorithms obtain 
the best clustering accuracies on 4 of the 7 data sets whereas sparse-kmeans and 
Selvarclust obtain the best clustering accuracies on respectively 2 and 1 data sets. 
The sparseFEM algorithms differ also from sparse-kmeans regarding the number of 
variables retained to perform the clustering. Indeed, sparse-kmeans turns out to 
frequently select a large number of variables whereas sparseFEM is usually rather 
sparse in the number of selected variables. Finally, Clustvarsel and Selvarclust turn 
out to select most of the time few variables, particularly in high- dimensional spaces, 
which seems to obstruct their clustering performance. To summarize, this experi- 
ment has shown that the sparseFEM algorithms seem to be good compromises be- 
tween sparse-kmeans and Clustvarsel /Selvarclust in term of variable selection and, 
certainly thanks to this characteristic, they also provide good clustering results. 

4.3 Comparison on the usps358 data set 

We focus now on the usps358 dataset to stress the role of variable selection in the 
interpretation of clustering results. The original dataset is made of 7291 images 
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iris 


wine 


chiro 


zoo 


glass 


satimage 


usps358 




(p=4 } K=3) 


(p=13,K=3) 


( p=17jK=3 ) 


(p=16,K=7) 


( p=9jK =7) 


(p=36,K=6) 


(p=256,K=3) 


Approaches 


(n=150) 


(n=178) 


(n=178) 


(n=101) 


(n=214) 


(n=4435) 


(n=1726) 


sparseFEM-1 


96.5±0.3 


97.8±0.2 


84.2±11 


71.4±8.5 


50.2±1.9 


69.6±0.1 


84.7±3.2 




(2.0±0.0) 


(2.0±Q.Q) 


(2.3±0.5) 


(13±2.h) 


(6.0±1.Q) 


(36±o.o) 


(5.5±0.7) 


sparseFEM-2 


89.9±0.4 


98.3±0.0 


84.8±12 


70.1±12.2 


48.4±3.0 


67.5±1.6 


82.8±9.1 


sparseFEM-3 


(4.o±o.o) 


(4.0±0.0) 


(2.0±0.6) 


^±3.6; 


(6.6±0.7) 


(36±.o.o) 


(15.5±16) 




97.8±0.0 


82.9±12 


72.0±4.3 


48.2±2.7 




79.1±7.4 




(2.0±0.3) 


(2.0±0.0) 


(2.0±Q.Q) 


(io±2.8) 


(7.0±0.0) 


(s&to.o; 




sparse-kmeans 


90.7 


94.9 


95.3 


79.2 


52.3 


71.4 


74.7 




(4.0) 


(13.0) 


(17.0) 


(16.0) 


(6.0) 


(36.0) 


(m; 


Clustvarsel 


96.0 


92.7 


71.1 


75.2 


48.6 


58.7 


48.3 




(3.0) 


(5.0) 


(m; 


0.0; 




fM0j 


(o; 


Selvarclust 


96.0 


94.4 


92.6 


92.1 


43.0 


56.4 


36.7 




(3.0) 


(5.0) 













Table 3: Clustering accuracies and their standard deviations (in percentage) on 7 UCI datasets (iris, wine, chironomus, zoo, glass, 
satimage, usps358) averaged on 20 trials. The average number of nonzero variables is reported in brackets. No standard deviation is 
reported for Clustvarsel/Selvarclust and sparse-kmeans since their initialization procedure is deterministic and always provides the 
same initial partition. 



divided in 10 classes corresponding to the digits from to 9. Each digit is a 16 x 16 
gray level image represented as a 256-dimensional vector. For this experiment, we 
extracted a subset of the data (n = 1 756) corresponding to the digits 3, 5 and 8 
which are the three most difficult digits to discriminate. This smaller dataset is 
hereafter called usps358. Figure E] depicts the group mean images obtained from the 
true labels in the usps358 dataset. For this experiment, we used the three sparse- 
FEM algorithms with the model and the level of sparsity selected in the previous 
experiment for this data set. For Clustvarsel, Selvarclust and sparse-kmeans, the 
level of sparsity was again selected with their own selection procedure. 

Figures [3] illustrates, as images, the features selected respectively by sparse- 
kmeans (Figure [3Ja), Clustvarsel (Figure EJb) and Selvarclust (Figure Elc). In Fig- 
ure [3] a, the weight assigned by sparse-kmeans to each feature is represented by gray 
levels: lighter is the pixel, weaker is the absolute value of the weight of the associated 
feature. For Clustvarsel and Selvarclust, only the selected variables are depicted and 
are associated to black pixels as it is illustrated in Figures EJb and[3Jc respectively. 
These representations are associated to the following clustering accuracies 74.7%, 
48.3% and 36.7% for sparse-kmeans, Clustvarsel and Selvarclust respectively. For 
the 3 sparseFEM algorithms, we superimposed in a same figure the absolute values of 
the loadings of the two discriminative axes fitted by the sparseFEM- 1, sparseFEM-2 
and sparseFEM-3 procedures. The associated clustering accuracies are respectively 
84.7%, 82.8% and 79.1%. 

First of all, it appears that Clustvarsel and Selvarclust select significantly fewer 
variables than both sparse-kmeans or the sparseFEM procedures. Furthermore, 
most of the selected variables by Clustvarsel and Selvarclust turn out to be irrelevant 
to discriminate the digit 3 from the digits 5 and 8. For instance, in Figures EJb 
and[3Jc, we can observe that the black pixels located in right bottom corner, do not 
correspond to any discriminative variable. This certainly explain the poor clustering 
performances (48.3% for Clustvarsel and 36.7% for Selvarclust) observed on this data 
set for these methods. On the contrary, sparse-kmeans turns out to perform well 
in term of clustering performance (74.7% of clustering accuracy). Nevertheless, the 
number of selected variables remains higher (213 selected variables amongst 256 
original ones) than we would expect to ease the interpretation of results. Finally, 
sparseFEM- 1 and sparseFEM-2 seem to answer quite well to both the clustering task 
and the task of feature selection. Indeed, on the one hand, the subset of selected 
pixels remains small for both algorithms: 6 and 15 pixels are selected amongst 256 
for sparseFEM-1 and sparseFEM-2 respectively. Furthermore, the selected pixels 
appear to be relevant to discriminate the classes associated with the three digits. 
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(a) (b) (c) 

Figure 2: Group means obtained from the true labels in the USP358 datasets. 



(a) Sparse-kmeans 



(b) Clustvarsel 



(c) Selvarclust 



Figure 3: Variable selection obtained from (a) the sparse-kmeans algorithm, (b) the 
Clustvarsel approach and (c) the Selvarclust approach. 



(a) sparseFEM-1 



■ 

(b) sparseFEM-2 



(c) sparseFEM-3 



Figure 4: Variable selection obtained from (a) the sparseFEM-1, (b) the sparseFEM- 
2 and (c) the sparseFEM-3 procedures with sparsity levels selected by the penal- 
ized BIC. 
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Approaches: Procedure time (sec) 



Approaches: Procedure time (sec) 



sparseFEM-1 
sparseFEM-2 
sparseFEM-3 



729.04 
387.12 
409.61 



sparse-kmeans 

Clustvarsel 

Selvarclust 



1 567.75 

2 957.70 
9 257.10 



Table 4: Computing times for the 3 versions of the sparseFEM algorithm, sparse- 
kmeans, Clustvarsel and Selvarclust on the USPS358 data (for a given model and 
with A and K fixed). 

For instance, the darker pixel on the bottom right corner of Figure HJb discriminates 
the digit 8 from the digits 3 and 5. On the other hand, and certainly due to 
this relevant selection of variables, both algorithms perform particularly well on 
this high- dimensional data set (84.7% for sparseFEM-1 and 82.8% for sparseFEM- 
2). However, on this data set, the sparseFEM-3 procedure shows a disappointing 
behavior regarding the variable selection even though its clustering performance 
remains satisfying. The fact that sparseFEM-3 succeeds in clustering the data set 
even with a bad selection of variables is certainly due to the nature of the DLM 
model which models also the non discriminative information through the parameter 



Table @] presents the computing time of the studied clustering methods (for a 
given model and with A and K fixed) for clustering the usps358 data set. As we can 
remark, our procedures are much faster than the sparse-kmeans, Clustvarsel and 
Selvarclust algorithms. Consequently, the sparseFEM algorithms appear once again 
to be good compromises, in practice, to cluster high- dimensional data and select a 
set of discriminative variables in a reasonable time. 

5 Application to the segmentation of hyperspeo 
tral images 

Here, we propose to use sparseFEM to segment hyperspectral images of the Mar- 
tian surface. Visible and near infrared imaging spectroscopy is a key remote sensing 
technique to study the system of the planets. Imaging spectrometers, which are in- 
board of an increasing number of satellites, provide high-dimensional hyperspectral 
images. In March 2004, the OMEGA instrument (Mars Express, ESA) [4] has col- 
lected 310 Gbytes of raw images. The OMEGA imaging spectrometer has mapped 
the Martian surface with a spatial resolution varying between 300 to 3000 meters 
depending on the spacecraft altitude. It acquired for each resolved pixel the spec- 
trum from 0.36 to 5.2 At-m in 256 contiguous spectral channels. OMEGA is designed 
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Figure 5: Image of the studied zone of the Martian surface. 





Expert segmentation SparseFEM segmentation 

Figure 7: Segmentation of the hyperspectral image of the Martian surface using a 
physical model build by experts (left) and sparseFEM (right). 



to characterize the composition of surface materials, discriminating between vari- 
ous classes of silicates, hydrated minerals, oxides and carbonates, organic frosts and 
ices. For this experiment, a 300 x 128 image of the Martian surface is considered 
and a 256-dimensional spectral observation is therefore associated to each of the 
38 400 pixels. Figure El presents an image of the studied zone and Figure E] shows 
some of the 38 400 measured spectra. According to the experts, there are K = 5 
mineralogical classes to identify. 

The sparseFEM-1 algorithm was applied to this dataset using the model DLM^.^j 
and a sparsity ratio equals to 0.1 (it refers to the ratio of the t\ norm of the coeffi- 
cient vector relative to the norm at the full least square solution). The sparseFEM 
algorithm was initialized with the results of the Fisher-EM algorithm and the whole 
segmentation process took 18 hours on a 2.6 Ghz computer. Figure [7J presents, on 
the right panel, the segmentation into 5 mineralogical classes of the studied zone 
with the sparseFEM algorithm. In comparison, the left panel of Figure [7J shows the 
segmentation obtained by experts of the domain using a physical model. It first ap- 
pears that the two segmentations agree globally on the mineralogical nature of the 
surface of the studied zone (60.30% of agreement). We recall that both segmenta- 
tions do not exploit the spatial information. When looking at the top-right quarter 
of the image, we can notice that sparseFEM seems to provide a finer segmentation 
than the segmentation based on the physical model. Indeed, sparseFEM segments 
better than the physical model the fine "rivers" which can be seen on Figure 

Finally, Figure [8] shows the mean spectra of the 5 groups formed by sparseFEM 
and the selection of the discriminative wavelengths. SparseFEM selected 8 original 
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variables (wavelengths) as discriminative variables, i.e. the rows associated to these 
variables were non-zero in the loading matrix U. Looking closely at the selection, 
we indeed notice that the first selected variable (from left to right) discriminates the 
blue group from the others. The second selected variable discriminates the red and 
green groups from the black, blue and light blue groups whereas the third selected 
variable allows to discriminate the red, green and black groups from the blue and 
light blue groups. Similarly, the fourth and fifth selected variables discriminate the 
red and green groups from the black, blue and light blue groups whereas the sixth, 
seventh and eighth selected variable allows to discriminate the red, green and light 
blue groups from the blue and black groups. 

A possible interest of such a selection could be the measurement of only a tens 
of wavelengths for future acquisitions instead of the 256 current ones for a result 
expected to be similar. This could in particular reduce the acquisition time for each 
pixel from a few tens of seconds to less than one second. 

6 Conclusion 

This article has focused on variable selection for clustering with the Fisher-EM al- 
gorithm which has been recently proposed in [BJ. The aim of this work was to 
introduce sparsity in the Fisher-EM algorithm and thus select the discriminative 
variables among the set of original variables. We have proposed three different pro- 
cedures based on a £i-penalty term. Experiments on simulations and real data sets 
have shown that the three sparse versions of the Fisher-EM algorithm are highly 
competitive with existing approaches of the literature. In particular, the sparseFEM 
procedures present several assets regarding existing approaches. On the one hand, 
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they tend to select an intermediate number of discriminative variables whereas ex- 
isting approaches tend to select either too few (Clustvarsel and Selvarclust) or too 
much variables (sparse- kmeans). On the other hand, the sparseFEM procedures 
perform both the clustering and the variable selection in a reasonable time compar- 
ing to existing approaches in the case of high- dimensional data. The sparseFEM 
algorithms have been also applied with success to the segmentation of hyperspectral 
images of the planet Mars and relevant parts of the spectra which well discriminate 
the groups have been identified. 

Among the possible extensions of this work, it may be first interesting to use 
different ^i-penalty values according to the relevance of each discriminative axis 
estimated in the Fisher-EM algorithm. Such an approach could identify different 
levels of relevancy among the original variables. Second, we used in this work a pe- 
nalized BIC criterion to select the sparsity level by evaluating the model complexity 
in regard to the non-zero values as proposed by [2Z] • Although Zou et al. [42 J showed 
that the number of non-zero coefficients is an unbiased estimate of the degrees of 
freedom and is asymptotically consistent in the case of penalized regression prob- 
lem, this result has no theoretical justification in the penalized GMM context. It 
would be therefore interesting to obtain theoretical guarantees of such a result in 
our context. Finally, since the ICL criterion [5] is also used to select the number 
of components, it would be a natural extension to consider a penalized ICL for 
selecting the sparsity level in the sparseFEM algorithms. 
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